Skip to content

[UR][L0v2] Migrate discrete buffer through host when P2P is not accessible#22010

Open
ldorau wants to merge 3 commits into
intel:syclfrom
ldorau:URL0_Migrate_discrete_buffer_through_host_when_P2P_is_not_accessible
Open

[UR][L0v2] Migrate discrete buffer through host when P2P is not accessible#22010
ldorau wants to merge 3 commits into
intel:syclfrom
ldorau:URL0_Migrate_discrete_buffer_through_host_when_P2P_is_not_accessible

Conversation

@ldorau
Copy link
Copy Markdown
Contributor

@ldorau ldorau commented May 13, 2026

When a buffer on a discrete GPU needs to be accessed from a different
device and P2P access is not enabled, migrate the data through a USM
HOST staging buffer instead of returning UR_RESULT_ERROR_UNSUPPORTED_FEATURE.

The migration uses a two-step copy:

  1. Synchronous device->host copy using the source device's own command
    list (the destination device cannot reach source device memory
    without P2P).
  2. Async host->device copy enqueued on the caller's command list (host
    memory is accessible by all devices, so this is safe).

Before the device->host copy, any pending operations on the caller's
command list are ordered and drained via zeCommandListAppendWaitOnEvents

  • zeCommandListHostSynchronize, ensuring prior kernel writes to the
    source buffer are visible. A fully synchronous fallback is used when
    no command list is available (e.g. urMemGetNativeHandle).

Only one staging buffer is kept alive at a time: it is released at the
start of the next migration after zeCommandListHostSynchronize confirms
the previous async copy has completed.

A new ensureDeviceAlloc helper allocates the destination device buffer
without the activeAllocationDevice side-effect of allocateOnDevice,
so the active-device state is only updated after the async copy is
successfully enqueued.

Fixes: #22007
Fixes: #22008

@ldorau
Copy link
Copy Markdown
Contributor Author

ldorau commented May 14, 2026

Please review @intel/unified-runtime-reviewers-level-zero

// Migrate buffer through the host: copy from the current device to a
// temporary host buffer, then from host to the target device.
auto bufferSize = getSize();
std::vector<char> hostBuf(bufferSize);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe it is worth to consider USM allocation in place of heap, like in line 100

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +369 to +372
for (uint32_t i = 0; i < waitListView.num; i++) {
ZE2UR_CALL_THROWS(zeEventHostSynchronize,
(waitListView.handles[i], UINT64_MAX));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work. The operation also needs to be ordered with regards to the command list itself, so something like this will be better:

  if (numWaitEvents > 0) {
    ZE2UR_CALL(zeCommandListAppendWaitOnEvents,
               (zeCommandList.get(), numWaitEvents, pWaitEvents));
  }
  ZE2UR_CALL(zeCommandListHostSynchronize, (zeCommandList.get(), UINT64_MAX));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

auto bufferSize = getSize();
std::vector<char> hostBuf(bufferSize);

UR_CALL_THROWS(synchronousZeCopy(hContext, activeAllocationDevice,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the fact that this is synchronous. Can you explore what it would take to make it async? I think we'd need to keep the allocation somewhere.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Is it OK now?

@ldorau ldorau requested review from mateuszpn and pbalcer May 15, 2026 11:49
@ldorau
Copy link
Copy Markdown
Contributor Author

ldorau commented May 15, 2026

@mateuszpn @pbalcer re-review please

1 similar comment
@ldorau
Copy link
Copy Markdown
Contributor Author

ldorau commented May 18, 2026

@mateuszpn @pbalcer re-review please

@ldorau ldorau force-pushed the URL0_Migrate_discrete_buffer_through_host_when_P2P_is_not_accessible branch 2 times, most recently from 9727548 to 1e9d552 Compare May 19, 2026 07:16
@ldorau ldorau changed the title [UR][L0] Migrate discrete buffer through host when P2P is not accessible [UR][L0v2] Migrate discrete buffer through host when P2P is not accessible May 19, 2026
…sible

When a buffer on a discrete GPU needs to be accessed from a different
device and P2P access is not enabled, migrate the data through a USM
HOST staging buffer instead of returning UR_RESULT_ERROR_UNSUPPORTED_FEATURE.

The migration uses a two-step copy:
1. Synchronous device->host copy using the source device's own command
   list (the destination device cannot reach source device memory
   without P2P).
2. Async host->device copy enqueued on the caller's command list (host
   memory is accessible by all devices, so this is safe).

Before the device->host copy, any pending operations on the caller's
command list are ordered and drained via zeCommandListAppendWaitOnEvents
+ zeCommandListHostSynchronize, ensuring prior kernel writes to the
source buffer are visible.  A fully synchronous fallback is used when
no command list is available (e.g. urMemGetNativeHandle).

Only one staging buffer is kept alive at a time: it is released at the
start of the next migration after zeCommandListHostSynchronize confirms
the previous async copy has completed.

A new ensureDeviceAlloc helper allocates the destination device buffer
without the activeAllocationDevice side-effect of allocateOnDevice,
so the active-device state is only updated after the async copy is
successfully enqueued.

Fixes: intel#22007
Fixes: intel#22008

Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
@ldorau ldorau requested a review from a team as a code owner May 19, 2026 08:57
ldorau added 2 commits May 19, 2026 09:59
Add four conformance tests exercising discrete buffers accessed from
two different device queues when P2P access is not available.

Tests covering the async migration path (cmdList != nullptr, triggered
by urEnqueueMem* operations):
- AsyncFillThenReadOnSecondQueueWithWait: fills a buffer on queues[0]
  and reads it on queues[1] using an explicit event dependency.
- PingPongFillBetweenTwoDeviceQueues: alternates fills between queues[0]
  and queues[1], each read on the opposite queue using event dependencies.
- ChainedAsyncOpsAcrossQueuesWithEvents: chains fill, blocking write,
  and read across two queues using cross-queue events.

Test covering the synchronous fallback path (cmdList == nullptr,
triggered by urMemGetNativeHandle):
- SyncFallbackMigrationViaNativeHandle: fills the buffer on device 0,
  calls urMemGetNativeHandle for device 1 to trigger synchronous
  host-staged migration, then verifies the data on device 1.

All tests add an explicit queues.size() < 2 guard (GTEST_SKIP) in case
the fixture minimum-device requirement changes, and cross-queue ordering
is expressed with events throughout to properly exercise the async
migration path.

A dedicated L0 v2 adapter runner (discrete_buffer_host_migration.cpp)
reuses the conformance test source under UR_LOADER_USE_LEVEL_ZERO_V2.

Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
The test was intermittently failing on CI hardware because the queue
create + USM fill + urQueueFinish sequence before the memory measurement
introduced a multi-millisecond time window.  During that window, async
driver cleanup from earlier P2P tests (which can fail to evict peer
residency via zeContextEvictMemory) or concurrent GPU workloads on
shared CI machines could change devices[1]'s GLOBAL_MEM_FREE reading
enough to trigger the assertion.

The queue/fill/finish operations are not needed to test the residency
property: zeContextMakeMemoryResident is invoked at urUSMDeviceAlloc
time, so measuring immediately after the allocation captures any
peer-residency side-effects without a blocking GPU operation in
between.  Remove those operations to keep the measurement window as
short as possible, matching the pattern already used in
allocationInitiallyAbsentOnPeer.

Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
@ldorau
Copy link
Copy Markdown
Contributor Author

ldorau commented May 20, 2026

@mateuszpn @pbalcer re-review please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants